16 research outputs found

    Online estimation of discrete densities using classifier chains

    Get PDF
    We propose an approach to estimate a discrete joint density online, that is, the algorithm is only provided the current example, its current estimate, and a limited amount of memory. To design an online estimator for discrete densities, we use classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains. Our experiments on synthetic data show that the approach is feasible and the estimated densities approach the true, known distribution with increasing amounts of data

    Online density estimates : a probabilistic condensed representation of data for knowledge discovery

    No full text
    The Internet of Things (IoT) and the data that is generated from its sensors are making new demands on data mining methods. These demands stem from the desire to benefit from the knowledge contained in this data and the increasing number of devices that are equipped with these sensors. According to companies like Intel or HP, the number of sensors worldwide is likely to reach more than one trillion by 2022. All of them will produce streams of measurements and leveraging knowledge from these streams requires infrastructure to analyze them in real-time. From a data mining perspective, this involves challenging tasks such as cleaning the data, handling large amounts of data, and preserving their privacy, to name a few. The state of the art in data mining already addressed some of these challenges, but the proposed methods are typically designed for a specific task (e.g., predicting a certain variable or finding frequent patterns) and perform this task while scanning the data stream. However, at the time of collecting the data, it is often not known what kind of analysis needs to be performed or there are several -- possibly even dependent -- analysis tasks. This means that whenever storing the original data is either not feasible due to the sheer volume or impossible due to privacy concerns, the user has to wait for more data to initiate another analysis task, which impedes the use of conventional data mining algorithms. Therefore, we present a framework in this thesis, called MiDEO (Mining Density Estimates inferred Online), which decouples the process of collecting the data from the actual analysis. It uses density estimates to maintain a compact representation of the data stream and provides inference capabilities to perform queries on them. The queries can be combined to complex data mining tasks and allow to adapt the estimates to the current needs of the user or the algorithm. Compared to current methods that typically focus on one task at a time, this enables a more interactive analysis of the data stream, where the task selection is part of the analysis. In the course of designing such a framework, we develop several methods to improve the state of the art. This includes online density estimators for conditional joint densities with mixed types of variables, an online density estimator for high-dimensional data, algorithms to perform pattern mining on online density estimates, an online density estimator that is able to represent recurrences in the data stream, and algorithms that enforce well-known privacy-preserving properties to protect the entities described by the data. To show the effectiveness of these methods, we prove some of their theoretical properties and perform an extensive set of experiments.Das Internet of Things (IoT) und die aus dessen Sensoren generierten Daten stellen neue Anforderungen an Data Mining Methoden. Diese Anforderungen gehen aus dem Wunsch hervor, von dem den Daten inhĂ€renten Wissen zu profitieren sowie der wachsenden Anzahl von mit Sensoren ausgestatteten GerĂ€ten gerecht zu werden. Firmen wie Intel oder HP zufolge kann im Jahr 2022 mit ĂŒber einer Trillion Sensoren weltweit gerechnet werden. All diese Sensoren werden Ströme von Messdaten produzieren, deren Echtzeit-Analyse eine angemessene Infrastruktur voraussetzt. Im Data Mining stellen sich damit primĂ€r neue Herausforderungen wie unter anderem das Bereinigen der Daten, der Umgang mit sehr großen Datenmengen sowie die BerĂŒcksichtigung der PrivatsphĂ€re. FĂŒhrende Data Mining Methoden haben sich bereits mit einigen dieser Herausforderungen befasst, allerdings sind sie typischerweise auf eine bestimmte Data Mining Aufgabe zugeschnitten (z.B. das Vorhersagen einer Variablen oder das Finden von hĂ€ufigen Mustern), die darĂŒberhinaus beim Scannen des Datenstroms ausgefĂŒhrt wird. Jedoch steht wĂ€hrend des Sammelns der Daten ĂŒblicherweise nicht fest, welche Art von Analyse benötigt wird oder es sind mehrere -- gegebenenfalls voneinander abhĂ€ngige -- Analysen erforderlich. Können Daten wegen ihres Volumens oder aus GrĂŒnden der PrivatsphĂ€re nicht gespeichert werden, ist der Nutzer gezwungen, auf neu eintreffende Daten zu warten, bevor er eine neue Analyse durchfĂŒhren kann. Um diesem Problem entgegen zu gehen, prĂ€sentieren wir in dieser Arbeit das MiDEO (Mining Density Estimates inferred Online) Framework. Dieses entkoppelt den Prozess der Datensammlung von der Datenanalyse. Mittels Online-DichteschĂ€tzern verfĂŒgt es ĂŒber eine jederzeit aktuelle sowie kompakte Version des Datenstroms und stellt Inferenzalgorithmen zur VerfĂŒgung um Anfragen auf die Daten zu erlauben. Diese Anfragen können zu komplexen Data Mining Aufgaben kombiniert werden und erlauben dem Benutzer eine Anpassung gemĂ€ĂŸ den aktuellen Anforderungen. Verglichen mit herkömmlichen Methoden wird so eine interaktivere Analyse der Datenströme ermöglicht, wobei die Wahl der zu lösenden Data Mining Aufgabe Teil der Analyse ist. Im Zuge der Entwicklung dieses Frameworks haben wir mehrere kompetitive Methoden entwickelt. Dies beinhaltet Online-DichteschĂ€tzer fĂŒr bedingte Verbundwahrscheinlichkeiten mit gemischten Variablentypen, einen Online-DichteschĂ€tzer fĂŒr hochdimensionale Daten, auf Online-DichteschĂ€tzern arbeitende Pattern Mining Algorithmen, einen Rekurrenzen darstellenden Online-DichteschĂ€tzer sowie Algorithmen, welche die PrivatsphĂ€re, der in den Daten beschriebenen Individuen schĂŒtzen. Die EffektivitĂ€t dieser Methoden wird durch den Beweis einiger theoretischer Eigenschaften und umfangreiche Experimente belegt

    Privacy-preserving pattern mining on online density estimates

    No full text
    Traditional pattern mining algorithms require access to the data, either in the form of a complete set of data, as in batch data mining, or in the form of a window of recent data, as in stream mining. In the case of stream mining, this comes with a number of disadvantages, such as the possibly unbounded growth of relevant instances, drift, possibly changing data mining tasks, and issues with privacy, to name a few. Therefore, an approach has been recently proposed that extracts patterns just from statistical information of the stream - more precisely, an online density estimate that is inferred from it. As this approach is mainly based on sampling from the density estimates, it still struggles with itemsets having a medium to low frequency. To resolve this issue, we pursue an alternative strategy in this paper and directly exploit the structure of the density estimates to extract frequent itemsets. Additionally, we address the important matter of privacy-preserving data mining by ensuring that the density estimate fulfills privacy-related properties. To show the effectiveness of the proposed methods, we provide proofs and evaluate the performance on synthetic and real-world data

    A probabilistic condensed representation of data for stream mining

    No full text
    Data mining and machine learning algorithms usually operate directly on the data. However, if the data is not available at once or consists of billions of instances, these algorithms easily become infeasible with respect to memory and run-time concerns. As a solution to this problem, we propose a framework, called MiDEO (Mining Density Estimates inferred Online), in which algorithms are designed to operate on a condensed representation of the data. In particular, we propose to use density estimates, which are able to represent billions of instances in a compact form and can be updated when new instances arrive. As an example for an algorithm that operates on density estimates, we consider the task of mining association rules, which we consider as a form of simple statements about the data. The algorithm, called POEt (Pattern mining on Online density esTimates), is evaluated on synthetic and real-world data and is compared to state-of-the-art algorithms

    Modeling recurrent distributions in streams using possible worlds

    No full text
    Discovering changes in the data distribution of streams and discovering recurrent data distributions are challenging problems in data mining and machine learning. Both have received a lot of attention in the context of classification. With the ever increasing growth of data, however, there is a high demand of compact and universal representations of data streams that enable the user to analyze current as well as historic data without having access to the raw data. To make a first step towards this direction, we propose a condensed representation that captures the various - possibly recurrent - data distributions of the stream by extending the notion of possible worlds. The representation enables queries concerning the whole stream and can, hence, serve as a tool for supporting decision-making processes or serve as a basis for implementing data mining and machine learning algorithms on top of it. We evaluate this condensed representation on synthetic and real-world data

    Online density estimation of heterogeneous data streams in higher dimensions

    No full text
    The joint density of a data stream is suitable for performing data mining tasks without having access to the original data. However, the methods proposed so far only target a small to medium number of variables, since their estimates rely on representing all the interdependencies between the variables of the data. High-dimensional data streams, which are becoming more and more frequent due to increasing numbers of interconnected devices, are, therefore, pushing these methods to their limits. To mitigate these limitations, we present an approach that projects the original data stream into a vector space and uses a set of representatives to provide an estimate. Due to the structure of the estimates, it enables the density estimation of higher-dimensional data and approaches the true density with increasing dimensionality of the vector space. Moreover, it is not only designed to estimate homogeneous data, i.e., where all variables are nominal or all variables are numeric, but it can also estimate heterogeneous data. The evaluation is conducted on synthetic and real-world data. The software related to this paper is available at https://​github.​com/​geilke/​mideo

    A probabilistic condensed representation of data for stream mining

    No full text
    Data mining and machine learning algorithms usually operate directly on the data. However, if the data is not available at once or consists of billions of instances, these algorithms easily become infeasible with respect to memory and run-time concerns. As a solution to this problem, we propose a framework, called MiDEO (Mining Density Estimates inferred Online), in which algorithms are designed to operate on a condensed representation of the data. In particular, we propose to use density estimates, which are able to represent billions of instances in a compact form and can be updated when new instances arrive. As an example for an algorithm that operates on density estimates, we consider the task of mining association rules, which we consider as a form of simple statements about the data. The algorithm, called POEt (Pattern mining on Online density esTimates), is evaluated on synthetic and real-world data and is compared to state-of-the-art algorithms
    corecore